Writing Functions

Monday, May 13

Today we will…

  • Remember: Group Formation Survey due tonight!
  • New Material
    • Function Basics
    • Variable Scope + Environment
  • PA 7: Writing Functions

Why write functions?

Functions allow you to automate common tasks!

  • We’ve been using functions since Day 1, but when we write our own, we can customize them!
  • Have you found yourself copy-pasting code and only changing small parts?

Writing functions has 3 big advantages over copy-paste:

  1. Your code is easier to read.
  2. To change your analysis, simply change one function.
  3. You avoid mistakes from copy-paste.

Function Basics

Function Syntax


Function Syntax

A (very) Simple Function

Let’s define the function.

  • You must run the code to define the function just once.
add_two <- function(x){
  x + 2
}


Let’s call the function!

add_two(5)
[1] 7

Naming: add_two <-

The name of the function is chosen by the author.

add_two <- function(x){
  x + 2
}

Caution: Function names have no inherent meaning.

  • The name you give to a function does not affect what the function does.
add_three <- function(x){
  x + 7
}
add_three(5)
[1] 12

Arguments

The argument(s) of the function are chosen by the author.

  • Arguments are how we pass external values into the function.
  • They are temporary variables that only exist inside the function body.
  • We give them general names:
    • x, y, z – vectors
    • df – dataframe
    • i, j – indices


add_two <- function(x){
  x + 2
}

Arguments

If we supply a default value when defining the function, the argument is optional when calling the function.

add_something <- function(x, something = 2){
  return(x + something)
}
  • If a value is not supplied, something defaults to 2.
add_something(x = 5)
[1] 7
add_something(x = 5, something = 6)
[1] 11

If we do not supply a default value when defining the function, the argument is required when calling the function.

add_something <- function(x, something){
  x + something
}

add_something(x = 2)
Error in add_something(x = 2): argument "something" is missing, with no default

Body: { }

The body of the function is where the action happens.

  • The body must be specified within a set of curly brackets.
  • The code in the body will be executed (in order) whenever the function is called.
add_two <- function(x){
  x + 2
}

Output: return()

Your function will give back what would normally print out

add_two <- function(x){
  x + 2
}


7 + 2
[1] 9
add_two(7)
[1] 9


…but it’s better to be explicit and use return().

add_two <- function(x){
  return(x + 2)
}

Output: return()

If you need to return more than one object from a function, wrap those objects in a list.

min_max <- function(x){
  lowest <- min(x)
  highest <- max(x)
  return(list(lowest, highest))
}

vec <- c( 346,547,865,346,6758,78,79,362)
min_max(vec)
[[1]]
[1] 78

[[2]]
[1] 6758

Input Validation

When a function requires an input of a specific data type, check that the supplied argument is valid.

add_something <- function(x, something){
  stopifnot(is.numeric(x))
  return(x + something)
}

add_something(x = "statistics", something = 5)
Error in add_something(x = "statistics", something = 5): is.numeric(x) is not TRUE
add_something <- function(x, something){
  if(!is.numeric(x)){
    stop("Please provide a numeric input for the x argument.")
  }
  return(x + something)
}

add_something(x = "statistics", something = 5)
Error in add_something(x = "statistics", something = 5): Please provide a numeric input for the x argument.
add_something <- function(x, something){
  if(!is.numeric(x) | !is.numeric(something)){
    stop("Please provide numeric inputs for both arguments.")
  }
  return(x + something)
}

add_something(x = 2, something = "R")
Error in add_something(x = 2, something = "R"): Please provide numeric inputs for both arguments.
add_something <- function(x, something){
  stopifnot(is.numeric(x), is.numeric(something))
  return(x + something)
}

add_something(x = "statistics", something = "R")
Error in add_something(x = "statistics", something = "R"): is.numeric(x) is not TRUE

Variable Scope + Environment

Variable Scope

The location (environment) in which we can find and access a variable is called its scope.

  • We need to think about the scope of variables when we write functions.
  • What variables can we access inside a function?
  • What variables can we access outside a function?

Global Environment

  • The top right pane of Rstudio shows you the global environment.
    • This is the current state of all objects you have created.
    • These objects can be accessed anywhere.

Function Environment

  • The code inside a function executes in the function environment.
    • Function arguments and any variables created inside the function only exist inside the function.
      • They disappear when the function code is complete.
    • What happens in the function environment does not affect things in the global environment.

Function Environment

We cannot access variables created inside a function outside of the function.

add_two <- function(x) {
  my_result <- x + 2
  return(my_result)
}

add_two(9)
[1] 11
my_result
Error in eval(expr, envir, enclos): object 'my_result' not found

Name Masking

Name masking occurs when an object in the function environment has the same name as an object in the global environment.

add_two <- function(x) {
  my_result <- x + 2
  return(my_result)
}
my_result <- 2000

The my_result created inside the function is different from the my_result created outside.

add_two(5)
[1] 7
my_result
[1] 2000

Dynamic Lookup

Functions look for objects FIRST in the function environment and SECOND in the global environment.

  • If the object doesn’t exist in either, the code will give an error.
add_two <- function() {
  return(x + 2)
}

add_two()
Error in add_two(): object 'x' not found
x <- 10

add_two()
[1] 12

It is not good practice to rely on global environment objects inside a function!

Debugging

(Allison Horst)

Debugging

You will make mistakes (create bugs) when coding.

  • Unfortunately, it becomes more and more complicated to debug your code as your code gets more sophisticated.
  • This is especially true with functions!

Debugging Strategies

  • Interactive coding
    • Highlight lines within your function and run them one-by-one to see what happens.
  • print() debugging
    • Add print() statements throughout your code to make sure the values are what you expect.
  • Rubber Ducking
    • Verbally explain your code line by line to a rubber duck (or a human).

Debugging Strategies

When you have a concept that you want to turn into a function…

  1. Write a simple example of the code without the function framework.

  2. Generalize the example by assigning variables.

  3. Write the code into a function.

  4. Call the function on the desired arguments

This structure allows you to address issues as you go.

An Example

Write a function called find_car_make() that takes in the name of a car and returns the “make” of the car (the company that created it).

  • find_car_make("Toyota Camry") should return “Toyota”.
  • find_car_make("Ford Anglica") should return “Ford”.

An Example

make <- str_extract(string = "Toyota Camry",
                    pattern = "[:alpha:]*")
make
[1] "Toyota"
make <- str_extract(string = "Ford Anglica",
                    pattern = "[:alpha:]*")
make
[1] "Ford"
car_name <- "Toyota Camry"

make <- str_extract(string = car_name, 
                    pattern = "[:alpha:]*")
make
[1] "Toyota"
find_car_make <- function(car_name){
  make <- str_extract(string = car_name, 
                      pattern = "[:alpha:]*")
  return(make)
}
find_car_make("Toyota Camry")
[1] "Toyota"
find_car_make("Ford Anglica")
[1] "Ford"

PA 7: Writing Functions

You will write a few small functions and use them to unscramble a message!

To do…

  • PC1: Group Project Formation Survey
    • Due TODAY – Monday, 5/13 at 11:59pm.
  • PA 7: Writing Functions
    • Due Wednesday, 5/15 at 10:00am.

Wednesday, May 15

Today we will…

  • Discuss Groupt Project + Group Contract
  • New Material
    • Calling Functions on Datasets
    • Thinking About Missing Data
  • Lab 7: Functions and Fish

Group Project Details

Check out the Canvas page outlining the group project!


  • Groups have been assigned.
  • Your group contract is due on Monday!

Calling Functions on Datasets

Last Time…

We wrote a function called find_car_make() that takes in the name of a car and returns the “make” of the car (the company that created it).

  • find_car_make("Toyota Camry") returns “Toyota”.
  • find_car_make("Ford Anglica") returns “Ford”.
find_car_make <- function(car_name){
  make <- str_extract(string = car_name, 
                      pattern = "[:alpha:]*")
  return(make)
}

Pair Our Function with dplyr

Consider the mtcars data.

data(mtcars)
head(mtcars, n = 3)
               mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

Let’s use our new function:

mtcars |> 
  rownames_to_column("make_model") |> 
  mutate(make = find_car_make(make_model),
         .after = make_model) |> 
  head(n = 3)
     make_model   make  mpg cyl disp  hp drat    wt  qsec vs am gear carb
1     Mazda RX4  Mazda 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
2 Mazda RX4 Wag  Mazda 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
3    Datsun 710 Datsun 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

Recall the penguins Data

library(palmerpenguins)
data(penguins)
penguins |> 
  head()
# A tibble: 6 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
# ℹ 2 more variables: sex <fct>, year <int>

Function to Standardize Data

We want to take in a vector of numbers and standardize it – make all values be between 0 and 1.

std_to_01 <- function(var) {
  stopifnot(is.numeric(var))
  
  num <- var - min(var, na.rm = TRUE)
  denom <- max(var, na.rm = TRUE) - min(var, na.rm = TRUE)
  
  return(num / denom)
}

Standardizing Variables

Is it a good idea to standardize (scale) variables in a data analysis?

Why standardize?

  • Easier to compare across variables.
  • Easier to model – standardizes the amount of variability.

Why not standardize?

  • More difficult to interpret the values.

E.g., a penguin with a bill length of 35 mm (std to 0.11) and a mass of 5500 g (std to 0.78).

Pair Our Function with dplyr

Let’s standardize penguin measurements.

penguins |> 
  mutate(bill_length_mm    = std_to_01(bill_length_mm), 
         bill_depth_mm     = std_to_01(bill_depth_mm), 
         flipper_length_mm = std_to_01(flipper_length_mm), 
         body_mass_g       = std_to_01(body_mass_g))
  • Ugh. Still copy-pasting!

Recall across()!

penguins |> 
  mutate(across(.cols = bill_length_mm:body_mass_g,
                .fns = ~ std_to_01(.x))) |> 
  slice_head(n = 4)
# A tibble: 4 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torgersen          0.255         0.667             0.153       0.292
2 Adelie  Torgersen          0.269         0.512             0.237       0.306
3 Adelie  Torgersen          0.298         0.583             0.390       0.153
4 Adelie  Torgersen         NA            NA                NA          NA    
# ℹ 2 more variables: sex <fct>, year <int>

Use variables as function arguments?

std_column_01 <- function(data, variable) {
  stopifnot(is.data.frame(data))
  
  data <- data |> 
    mutate(variable = std_to_01(variable))
  return(data)
}

Note

I used the existing function std_to_01() inside the new function for clarity!

But it didn’t work…

std_column_01(penguins, body_mass_g)
Error in `mutate()`:
ℹ In argument: `variable = std_to_01(variable)`.
Caused by error:
! object 'body_mass_g' not found

Tidy Evaluation

Functions using unquoted variable names as arguments are said to use nonstandard evaluation or tidy evaluation.

Tidy:

penguins |> 
  pull(body_mass_g)

  OR

penguins$body_mass_g

Untidy:

penguins[, "body_mass_g"]

  OR

penguins[["body_mass_g"]]


Tidy evaluation isn’t naturally supported when writing your own functions.

Defused R Code

When a piece of code is defused, R doesn’t return its value like normal.

  • Instead it returns an expression that describes how to evaluate it.

Evaluated code:

1 + 1
[1] 2

Defused code:

expr(1 + 1)
1 + 1

We produce defused code when we use tidy evaluation and our own functions don’t know how to handle it.

Solution 1

Don’t use tidy evaluation in your own functions.

  • This is more complicated to read and use, but it’s safe.
std_column_01 <- function(data, variable) {
  stopifnot(is.data.frame(data))
  
  data[[variable]] <- std_to_01(data[[variable]])
  return(data)
}

std_column_01(penguins, "bill_length_mm")

Solution 2: rlang

Use the rlang package!

  • This package provides operators that simplify writing functions around tidyverse pipelines.

  • Read more about using this package for function writing here!

Solution 2: rlang

Two ways to get around the issue of defused code:

  1. Embrace Operator ({{ }})
  • With {{ }}, you can transport a variable from one function to another.
  1. Defuse and Inject
  • You can first use enquo(arg) to defuse the variable.
  • Then use !!arg to inject the variable.

Solution 2: rlang

If we use either of these solutions, we also need to use the walrus operator (:=).

  • This means we have to use := instead of = in any dplyr verb containing one of these rlang fixes.

Recall Our Broken Function

std_column_01 <- function(data, variable) {
  stopifnot(is.data.frame(data))
  
  data <- data |> 
    mutate(variable = std_to_01(variable))
  return(data)
}

std_column_01(penguins, body_mass_g)
Error in `mutate()`:
ℹ In argument: `variable = std_to_01(variable)`.
Caused by error:
! object 'body_mass_g' not found
  • The code is defused, so mutate() doesn’t know what body_mass_g is.
  • We need to modify variable to make this work correctly!

Fixing Our Broken Function

std_column_01 <- function(data, variable) {
  stopifnot(is.data.frame(data))

  data <- data |>
    mutate({{variable}} := std_to_01({{variable}}))
  return(data)
}

std_column_01(penguins, body_mass_g)
# A tibble: 6 × 7
  species island    bill_length_mm bill_depth_mm body_mass_g sex     year
  <fct>   <fct>              <dbl>         <dbl>       <dbl> <fct>  <int>
1 Adelie  Torgersen           39.1          18.7       0.292 male    2007
2 Adelie  Torgersen           39.5          17.4       0.306 female  2007
3 Adelie  Torgersen           40.3          18         0.153 female  2007
4 Adelie  Torgersen           NA            NA        NA     <NA>    2007
5 Adelie  Torgersen           36.7          19.3       0.208 female  2007
6 Adelie  Torgersen           39.3          20.6       0.264 male    2007
std_column_01 <- function(data, variable) {
  stopifnot(is.data.frame(data))
  
  variable <- enquo(variable)

  data <- data |>
    mutate(!!variable := std_to_01(!!variable))
  return(data)
}

std_column_01(penguins, body_mass_g)
# A tibble: 6 × 7
  species island    bill_length_mm bill_depth_mm body_mass_g sex     year
  <fct>   <fct>              <dbl>         <dbl>       <dbl> <fct>  <int>
1 Adelie  Torgersen           39.1          18.7       0.292 male    2007
2 Adelie  Torgersen           39.5          17.4       0.306 female  2007
3 Adelie  Torgersen           40.3          18         0.153 female  2007
4 Adelie  Torgersen           NA            NA        NA     <NA>    2007
5 Adelie  Torgersen           36.7          19.3       0.208 female  2007
6 Adelie  Torgersen           39.3          20.6       0.264 male    2007

Inject Multiple Variables

What if I want to modify multiple columns?

  • Use across()!
std_column_01 <- function(data, variables) {
  stopifnot(is.data.frame(data))
  
  data <- data |> 
    mutate(across(.cols = {{variables}},
                  .fns = ~ std_to_01(.x)))
  return(data)
}

std_column_01(penguins, bill_length_mm:body_mass_g)
# A tibble: 5 × 7
  species island    bill_length_mm bill_depth_mm body_mass_g sex     year
  <fct>   <fct>              <dbl>         <dbl>       <dbl> <fct>  <int>
1 Adelie  Torgersen          0.255         0.667       0.292 male    2007
2 Adelie  Torgersen          0.269         0.512       0.306 female  2007
3 Adelie  Torgersen          0.298         0.583       0.153 female  2007
4 Adelie  Torgersen         NA            NA          NA     <NA>    2007
5 Adelie  Torgersen          0.167         0.738       0.208 female  2007

Missing Data

Types of Missing Data

  1. Missing Completely at Random (MCAR)
    • No difference between missing and observed values.
    • Missing observations are a random subset of all observations.
  2. Missing at Random (MAR)
    • Systematic difference between missing and observed values, but can be entirely explained by other observed variables.
  3. Missing Not at Random (MNAR)
    • Missingness is directly related to the unobserved value.

Types of Missing Data

Consider a study of depression.

  1. Missing Completely at Random (MCAR)
    • Some subjects have missing lab values because a batch of samples was processed improperly.
  2. Missing at Random (MAR)
    • Subjects who identify as men are less likely to complete a survey on depression severity.
  3. Missing Not at Random (MNAR)
    • Subjects with more severe depression are less likely to complete a survey on depression severity.

When we remove missing data…

We implicitly assume observations are missing completely at random!

  • We might be mostly removing data from subjects who identify as men.
  • We might be mostly removing data from subjects with severe depression.
  • We are inadvertently making our data less representative.

We need to take more care when dealing with missing values!

Dealing with Missing Data

  • Look for patterns!
    • Do observations with missing values have similar traits?
  • Consider outside explanations!
    • Why might missing data exist?
    • Should we have a “missing” category in our analysis?
  • Can we impute values?
    • If depression is MCAR within gender, age, and education level, then the distribution of depression will be similar for people of the same gender, age, and education level.

Lab 7: Functions + Fish

To do…

  • Lab 7: Functions + Fish
    • Due Saturday, 5/18 at 11:59pm.
  • Read Chapter 8: Functional Programming
    • Check-in 8.1 due Monday 5/20 at 10:00am.
  • Final Project Group Contract
    • Due Monday, 5/20 at 11:59pm.